{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该notebook为指导在树算法中实现一些额外的参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.sys.path.append(os.path.dirname(os.path.abspath('.')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## max_depth\n",
    "```max_depth```参数是控制树生成的最重要的参数之一。在递归生成树时没想到什么好办法可以直接获取当前树的深度，不过可以通过增加一个叶子节点数的全局变量来获取当前树的深度，原理在于深度为$d$的树，他的最小叶子节点数与最大叶子节点数是可以通过公式算出来的。以下代码全部引自之前的notebook，不再赘述。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import numpy as np\n",
    "# from datasets.dataset import load_breast_cancer\n",
    "# data=load_breast_cancer()\n",
    "# X,Y=data.data,data.target\n",
    "# del data\n",
    "\n",
    "# from model_selection.train_test_split import train_test_split\n",
    "# X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)\n",
    "\n",
    "# training_data=np.c_[X_train,Y_train]\n",
    "# testing_data=np.c_[X_test,Y_test]\n",
    "\n",
    "# def Gini(data, y_idx=-1):\n",
    "#     K = np.unique(data[:, y_idx])\n",
    "#     n_sample = len(data)\n",
    "#     gini_idx = 1 - \\\n",
    "#         np.sum([np.square(len(data[data[:, y_idx] == k])/n_sample) for k in K])\n",
    "\n",
    "#     return gini_idx\n",
    "\n",
    "# def BinSplitData(data,f_idx,f_val):\n",
    "#     data_left=data[data[:,f_idx]<=f_val]\n",
    "#     data_right=data[data[:,f_idx]>f_val]\n",
    "#     return data_left,data_right\n",
    "\n",
    "# from scipy import stats\n",
    "\n",
    "# def Test(data, criteria='gini', min_samples_split=5, min_samples_leaf=5, min_impurity_decrease=0.0):\n",
    "#     n_sample, n_feature = data.shape\n",
    "\n",
    "#     if n_sample < min_samples_split or len(np.unique(data[:,-1]))==1:\n",
    "#         return None, stats.mode(data[:, -1])[0][0]\n",
    "\n",
    "#     Gini_before = Gini(data)\n",
    "#     best_gain = 0\n",
    "#     best_f_idx = None\n",
    "#     best_f_val = stats.mode(data[:, -1])[0][0]\n",
    "\n",
    "#     for f_idx in range(n_feature-1):\n",
    "#         for f_val in np.unique(data[:, f_idx]):\n",
    "#             data_left, data_right = BinSplitData(data, f_idx, f_val)\n",
    "\n",
    "#             if len(data_left) < min_samples_leaf or len(data_right) < min_samples_leaf:\n",
    "#                 continue\n",
    "\n",
    "#             Gini_after = len(data_left)/n_sample*Gini(data_left) + \\\n",
    "#                 len(data_right)/n_sample*Gini(data_right)\n",
    "#             gain = Gini_before-Gini_after \n",
    "\n",
    "#             if gain < min_impurity_decrease or gain < best_gain:\n",
    "#                 continue\n",
    "#             else:\n",
    "#                 best_gain = gain\n",
    "#                 best_f_idx, best_f_val = f_idx, f_val\n",
    "\n",
    "#     return best_f_idx, best_f_val"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在递归生成树的函数之前增加一个全局变量：```nodes```，用于实时监控CART树的叶节点数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'cut_f': 23, 'cut_val': 880.8, 'left': 1.0, 'right': 0.0} 3\n"
     ]
    }
   ],
   "source": [
    "# nodes=0\n",
    "# max_depth=1\n",
    "\n",
    "# def CART(data,criteria='gini',min_samples_split=5,min_samples_leaf=5,min_impurity_decrease=0.0):\n",
    "#     best_f_idx,best_f_val=Test(data,criteria,min_samples_split,min_samples_leaf,min_impurity_decrease)\n",
    "    \n",
    "#     tree={}\n",
    "#     tree['cut_f']=best_f_idx\n",
    "#     tree['cut_val']=best_f_val\n",
    "    \n",
    "#     nonlocal nodes\n",
    "#     nodes+=1\n",
    "    \n",
    "#     if best_f_idx==None:\n",
    "#         return best_f_val\n",
    "    \n",
    "#     # 节点数超过最大深度的限制，也要返回叶节点，叶节点的值为当前数据中的目标值众数\n",
    "#     if nodes>=2**max_depth:\n",
    "#         return stats.mode(data[:, -1])[0][0]\n",
    "    \n",
    "#     data_left,data_right=BinSplitData(data,best_f_idx,best_f_val)\n",
    "#     tree['left']=CART(data_left,criteria,min_samples_split,min_samples_leaf,min_impurity_decrease)\n",
    "#     tree['right']=CART(data_right,criteria,min_samples_split,min_samples_leaf,min_impurity_decrease)\n",
    "    \n",
    "#     return tree\n",
    "\n",
    "# tree=CART(training_data)\n",
    "\n",
    "# # CART树存储为字典形式，将其字符串化后，每一个'left'代表左叉树枝，每一个'right'代表右叉树枝\n",
    "# # 节点数之和等于树枝数+1\n",
    "# tree_str=str(tree)\n",
    "# edge=tree_str.count('left')+tree_str.count('right')\n",
    "# assert edge+1==nodes\n",
    "\n",
    "# print(tree,nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sample_weight\n",
    "该参数用于控制样本在树的分裂时所占的权重，实质就是不同样本对于纯净度的贡献。因为sklearn中该参数是位于```fit```方法中，所以实现该参数需要结合```.py```工程文件来分析。\n",
    "\n",
    "```DecisionTreeClassifier.py```中的```fit```方法只有两步：\n",
    "1. 将```X_train```与```Y_train```拼接起来便于共同操作\n",
    "2. 递归调用```CART```方法\n",
    "\n",
    "而```CART```方法中又调用了```Test```方法与```BinSplitData```方法，所以需要修改的部分就是这四个方法。\n",
    "\n",
    "思路：\n",
    "- 在```fit```方法中接受一个```sample_weight```参数，同样与```X_train```以及```Y_train```拼到一起\n",
    "- Gini指数的计算方式需要改变，因为样本权重改变了$p_{k}$值，同时还要更改做test时加权Gini指数的计算方式\n",
    "- 做test时注意不要扫描```weight```列与```Y```列\n",
    "\n",
    "下面依次展示各方法中更改的代码部分："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit(self, X_train, Y_train，sample_weight=None):\n",
    "    # ...\n",
    "    sample_weight=sample_weight if sample_weight else np.array([1/len(X_train)]*len(X_train))\n",
    "    data = np.c_[X_train,sample_weight, Y_train]    # 权重为倒数第二列，目标值为最后一列\n",
    "    # ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def __Gini(self, data, y_idx=-1):\n",
    "    # ...\n",
    "    gini_idx = 1 - np.sum([np.square(np.sum(data[data[:, y_idx] == k][:,-2]) / np.sum(data[:,-2])) for k in K])\n",
    "    # ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def __Test(self, data):\n",
    "    # ...\n",
    "    n_sample, n_feature = data.shape\n",
    "    n_feature-=-2   # 除去第一列与最后一列\n",
    "    # ...\n",
    "    # 加权Gini的计算方式需要更改，改成数据子集的权重和\n",
    "    Gini_after = np.sum(data_left[:,-2])/np.sum(data[:,-2]) * self.__Gini(data_left) + \\\n",
    "                 np.sum(data_right[:,-2])/np.sum(data[:,-2]) * self.__Gini(data_right)\n",
    "    # ..."
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
